15 research outputs found
Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction
We tackle image question answering (ImageQA) problem by learning a
convolutional neural network (CNN) with a dynamic parameter layer whose weights
are determined adaptively based on questions. For the adaptive parameter
prediction, we employ a separate parameter prediction network, which consists
of gated recurrent unit (GRU) taking a question as its input and a
fully-connected layer generating a set of candidate weights as its output.
However, it is challenging to construct a parameter prediction network for a
large number of parameters in the fully-connected dynamic parameter layer of
the CNN. We reduce the complexity of this problem by incorporating a hashing
technique, where the candidate weights given by the parameter prediction
network are selected using a predefined hash function to determine individual
weights in the dynamic parameter layer. The proposed network---joint network
with the CNN for ImageQA and the parameter prediction network---is trained
end-to-end through back-propagation, where its weights are initialized using a
pre-trained CNN and GRU. The proposed algorithm illustrates the
state-of-the-art performance on all available public ImageQA benchmarks
Reinforcing an Image Caption Generator Using Off-Line Human Feedback
Human ratings are currently the most accurate way to assess the quality of an
image captioning model, yet most often the only used outcome of an expensive
human rating evaluation is a few overall statistics over the evaluation
dataset. In this paper, we show that the signal from instance-level human
caption ratings can be leveraged to improve captioning models, even when the
amount of caption ratings is several orders of magnitude less than the caption
training data. We employ a policy gradient method to maximize the human ratings
as rewards in an off-policy reinforcement learning setting, where policy
gradients are estimated by samples from a distribution that focuses on the
captions in a caption ratings dataset. Our empirical evidence indicates that
the proposed method learns to generalize the human raters' judgments to a
previously unseen set of images, as judged by a different set of human judges,
and additionally on a different, multi-dimensional side-by-side human
evaluation procedure.Comment: AAAI 202